Clinical characteristics and complication risks in data‐driven clusters among Chinese community diabetes populations

Abstract Background Novel diabetes phenotypes were proposed by the Europeans through cluster analysis, but Chinese community diabetes populations might exhibit different characteristics. This study aims to explore the clinical characteristics of novel diabetes subgroups under data‐driven analysis in Chinese community diabetes populations. Methods We used K‐means cluster analysis in 6369 newly diagnosed diabetic patients from eight centers of the REACTION (Risk Evaluation of cAncers in Chinese diabeTic Individuals) study. The cluster analysis was performed based on age, body mass index, glycosylated hemoglobin, homeostatic modeled insulin resistance index, and homeostatic modeled pancreatic β‐cell functionality index. The clinical features were evaluated with the analysis of variance (ANOVA) and chi‐square test. Logistic regression analysis was done to compare chronic kidney disease and cardiovascular disease risks between subgroups. Results Overall, 2063 (32.39%), 658 (10.33%), 1769 (27.78%), and 1879 (29.50%) populations were assigned to severe obesity‐related and insulin‐resistant diabetes (SOIRD), severe insulin‐deficient diabetes (SIDD), mild age‐associated diabetes mellitus (MARD), and mild insulin‐deficient diabetes (MIDD) subgroups, respectively. Individuals in the MIDD subgroup had a low risk burden equivalent to prediabetes, but with reduced insulin secretion. Individuals in the SOIRD subgroup were obese, had insulin resistance, and a high prevalence of fatty liver, tumors, family history of diabetes, and tumors. Individuals in the SIDD subgroup had severe insulin deficiency, the poorest glycemic control, and the highest prevalence of dyslipidemia and diabetic nephropathy. Individuals in MARD subgroup were the oldest, had moderate metabolic dysregulation and the highest risk of cardiovascular disease. Conclusion The data‐driven approach to differentiating the status of new‐onset diabetes in the Chinese community was feasible. Patients in different clusters presented different characteristics and risks of complications.


| INTRODUCTION
Diabetes is a chronic disease that the International Diabetes Federation (IDF) considers to be one of the fastestgrowing diseases of this century. 1According to the 2021 IDF report, approximately 537 million people worldwide have diabetes, which is expected to increase to 643 million by 2030 and 783 million by 2045. 2 It was reported that 1 in 12 all-cause deaths are attributable to diabetes. 3nfortunately, despite the availability of effective therapies, the prognosis for diabetes is poor, and diabetes patients experience serious microvascular and macrovascular complications and premature death more frequently than the general population. 4The most commonly used definitions of diabetes are based on fasting blood glucose (FBG), posting blood glucose (PBG), and glycosylated hemoglobin (HbA1c) levels 5 ; however, these criteria fail to provide therapeutic guidelines for diabetes. 6Indeed, considerable studies have shown that adult diabetes is a very heterogeneous metabolic disease with multiple underlying mechanisms. 7,8Thus, a more refined classification of diabetes could identify those patients at greatest risk for complications at diagnosis and enable personalized treatment protocols. 9n 2018, Ahlqvist et al. 9 used 8980 newly diagnosed diabetic populations aged 18 years or older in Sweden to classify the diabetic population into five novel subgroups based on anti-glutamic acid decarboxylase antibody (GADA), age at diagnosis, body mass index (BMI), glycosylated hemoglobin (HbA1c), homeostatic modeled insulin resistance index (HOMA-IR), and homeostatic modeled pancreatic β-cell functionality index (HOMA-β).For GADA-positive patients, they defined it as severe autoimmune diabetes (SAID), and for GADA-negative patients, they further categorized the patients into severe insulin-deficient diabetes (SIDD), severe insulin-resistant diabetes (SIRD), mild obesity-associated diabetes (MOD), and mild age-associated diabetes mellitus (MARD) subgroups by data-driven cluster analysis.They found that MARD and MOD subgroups had relatively good metabolic control and relatively few diabetic complications, whereas SAID, SIDD, and SIRD subgroups had relatively poor clinical outcomes.In recent years, there have been abundant studies 6,[10][11][12][13][14][15][16][17][18] utilizing data-driven cluster analysis to divide diabetic populations into novel subgroups, and the characteristics and risk of complications of those subgroups are very similar to the study by Ahlqvist et al. 9 At present, China has become the country with the largest number of diabetic populations. 6It was reported that China's diabetes populations accounted for 25% of the world's total diabetes populations 19 Moreover, Chinese diabetic patients have distinctive characteristics from the other populations. 20Therefore, exploring the unique characteristics of data-driven clusters of Chinese diabetic populations has crucial clinical guiding significance.6][17][18] Hence, the current study aims to explore the clinical characteristics of novel diabetes subgroups under data-driven analysis in a large-sample, multicenter Chinese community population.

| Database introduction
In the current study, participants were included from eight centers of the REACTION (Risk Evaluation of cAncers in Chinese diabeTic Individuals) study. 21Overall, 53 639 participants aged 40 years or older were investigated in the survey between March and December 2012.The research program was authorized by the Human Research of Ruijin Hospital.Before the data collection, every participant provided a written informed consent.First, we included a total of 13 364 people who met the American Diabetes Association's (ADA) criteria for diabetes (FBG ≥7.0 mmol/L or PBG ≥11.1 mmol/L or HbA1c ≥6.5%) 22 in the current survey.Next, 2670 subjects who self-reported having a previous diagnosis of diabetes or who were already using glucose-lowering medications or insulin were excluded; 2864 subjects with missing important data (including, insulin, biochemical markers, body measurements, personal history of disease, or family history of disease, etc), 12 patients with extreme outliers (>5 SDs from the mean), 9 583 subjects with a previous diagnosis of chronic kidney disease, and 866 subjects with acute diseases that might affect the glucose metabolism (like pancreatitis, liver cirrhosis, viral hepatitis) were also excluded.Ultimately, the present study included 6369 populations with newly diagnosed diabetes.

| Data collection
Participants were assisted by a trained staff member in a face-to-face conversation to complete the questionnaire, which included important information such as year of birth, personal and family medical history, current medication use, smoking and drinking habits, age, occupation, and physical activity.
After the participants removed their coats and shoes, their height, weight, waist circumference (WC), and hip circumference (HP) were measured and recorded by a trained staff.Systolic blood pressure (SBP) and diastolic blood pressure (DBP) were measured three times at five-min intervals by the same staff member using a calibrated automatic electronic device.The average of these three measurements was taken in the statistical analysis.
Blood samples were gathered at 8-9 a.m. after an 8-10 h overnight fast.Participants without or with diabetes were tested for 75 g oral glucose tolerance test (OGTT) or 100 g steamed-bread meal test, separately, and blood samples were gathered at 0 and 2 h.FBG and PBG were measured by the glucose oxidase method on an autoanalyzer.HbA1c was measured via the high-performance liquid chromatography using the VARIANT II Hemoglobin Testing System.Fasting insulin was measured with chemiluminescent immunoassay.Serum creatinine (SCr) was determined by the picric acid method.Serum triglyceride (TG), total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), aspartate transferase (AST), alanine transferase (ALT), glutamine transferase (GGT), and other biochemical indexes were determined using autoanalyzers.
The Spot morning urine samples were gathered.Urinary creatinine was determined by Jaffe's kinetic method.Urinary albumin was determined by the immunoturbidimetric assay.

| Definitions
Hypertension was defined as SBP ≥140 mmHg or DBP ≥90 mmHg 23 or self-reported prior diagnosis of hypertension.Dyslipidemia was defined as TC ≥6.2 mmol/L or LDL-C ≥ 4.1 mmol/L or TG ≥2.3 mmol/L or HDL-C ≤1.0 mmol/L 24 or self-reported previous diagnosis of lipid disorders.Chronic kidney disease (CKD) was defined by eGFR <60 mL/min per 1.73 m 2 or UACR ≥30 mg/g. 6ardiovascular disease was self-reported by patients at the time of the questionnaire survey. 25

| Statistical analysis
The K-means cluster analysis was performed with the same clustering parameters (age, BMI, HbA1c, HOMA-IR, and HOMA-β) as the study by Ahlqvist et al. 9 Box-Cox transformations were performed for variables not in line with normal distribution. 26,27All values were centered to a mean value of 0 and an SD of 1.The optimal number of K-means clusters (K value) was selected as 4 using silhouette methods. 9K-means cluster analysis was done with randomly selected initial cluster centers (runs = 100).The cluster statistics was performed using R software (v4.2.2) with packages "MASS", "factoextra," and "cluster."We named the subgroups categorized in this study based on previous research and different characteristics.To further explore the impact of clustering feature on cluster results, referring to published research, we included PBG as an additional feature on the basis of the five features from the study by Ahlavist.To compare the clustering results with each other, the clustering of six feature is directly performed using K = 4, and the other clustering methods were the same as the original clustering.The clustering results of the two feature sets were matched using the Sankey plot method.
Differences between subgroups were compared using one-way ANOVA (for Gaussian distribution variants), Kruskal-Wallis test (for non-Gaussian distribution variants), and chi-square test with multiple comparison, respectively.Binary logistic regression analysis was used to compare the risk of CKD and cardiovascular disease among the different subgroups, taking the subgroup with the lowest complication rate as the reference.Three models were also established for the logistic regression analysis to control the possible confounders.Model 1 was adjusted for age and center; model 2 was additionally adjusted for occupation, education, marriage, smoking habits, drinking habits, physical activity, and family history of diabetes; and model 3 was additionally adjusted for TG, TC, HDL-C, LDL-C, ALT, AST, GGT, SBP, and DBP.SPSS 24.0 (IBM, Chicago, IL) was used for this part of statistical analysis.All statistical tests were two-sided, and p < 0.05 was considered statistically significant.

| Differences in lifestyle habits, personal, and family history of disease between clusters
In Table 3, SIDD (47.87%) and MIDD (40.18%) clusters had a relatively high proportion of males, and SOIRD cluster had the lowest percentage of males (29.76%).Participants assigned to SIDD were more likely to have smoking and drinking habits than the other clusters.Family history of diabetes was most common in the SOIRD subgroup (21.93%) and least common in the MARD subgroup (10.25%).Patients in MARD subgroup (12.83%) had the highest prevalence of cardiovascular diseases.Self-reported fatty liver was most common in patients in SOIRD subgroup (62.34%).Self-reported retinopathy was most common in patients in the MARD subgroup (0.79%).Patients in the SOIRD subgroup reported the most tumor history (3.11%) and family history of tumors (15.29%) compared with those in the other subgroup.

| Risk of CKD and cardiovascular disease by subgroups
Table 4 shows the results of the logistic regression analysis.The MIDD subgroups with the lowest prevalence of CKD (14.0%) was used as the reference group.In model 3, after adjusting for the confounding factors, populations in SIDD (odds ratio [OR], 2.240; 95% confidence interval [CI], 1.755-2.860;p < 0.001), SOIRD (OR, 1.383; 95% CI, 1.138-1.681;p = 0.001), and MARD (OR, 1.619; 95% CI, 1.328-1.973;p < 0.001) subgroups all showed higher risks of CKD compared with MIDD subgroup, and the differences were all statistically significant.Besides, The MIDD cluster with the lowest prevalence of cardiovascular disease (3.73%) was used as the reference group.In model 3, after adjusting for the confounding factors, populations in SOIRD (OR, 1.573; 95% CI, 1.105-2.139;p = 0.011) and MARD (OR, 1.564; 95% CI, 1.120-2.185;p = 0.009) subgroups showed higher risks of cardiovascular disease compared with MIDD subgroup, and the differences were all statistically significant; however, there was no statistically significant difference in the risk of cardiovascular disease between the SIDD and MIDD subgroups (p = 0.894).S1 similarly showed that the characteristics of the typologies obtained by adding PBG as a clustering variable are essentially similar to those obtained from analyses conducted in the present study.

| DISCUSSION
Unlike the Ahlqvist research, the present study identified a specific cluster, SOIRD, which had appeared in European non-newly diagnosed diabetic populations 28  controversy about the variables used to generate patient subtypes. 4,29In fact, the OGTT and postprandial islet function test are still expensive and inconvenient for a routine exam; our current clustering method could use fewer variables to obtain the comparable clusters result as those in the Shanghai community diabetic populations and Indian diabetic populations, which would undoubtedly save time costs and economic costs.The MARD subgroup tended to be the most prevalent in previous studies, 4 whereas the SOIRD cluster was the most prevalent in the current study, which might be attributed to differences in ethnicity and differences between inpatients and patients in the community; in addition, the study populations in our study were residents over 40 years of age (the common age of onset of diabetes), whereas most of the previous studies have focused on Note: Model 1 was adjusted for age and center; model 2 was additionally adjusted for occupation, education, marriage, smoking habits, drinking habits, physical activity, and family history of diabetes; and model 3 was additionally adjusted for TG, TC, HDLC, LDLC, ALT, AST, GGT, SBP, and DBP.Abbreviations: ALT, alanine transferase; AST, aspartate transferase; DBP, diastolic blood pressure; GGT, glutamine transferase; HDL-C, high-density lipoprotein cholesterol; LDL-C, low-density lipoprotein cholesterol; MARD, mild age-associated diabetes mellitus; MIDD, mild insulin-deficient diabetes; SBP, systolic blood pressure; SIDD, severe insulin-deficient diabetes; SOIRD, severe obesity-related and insulinresistant diabetes; TC, total cholesterol; TG, triglyceride.
adults over 18 years old; the difference in age could also lead to a difference in category.
The SOIRD subgroup had the largest WC.Unlike BMI, WC is more responsive to visceral adipose tissue mass, 30 and a previous study 31 has found that the amount of visceral adipose tissue is higher in Asians than in Westerners at similar BMI, leading to greater insulin resistance in Asian patients with diabetes.We also observed the lowest HDL-C and highest TG in the SOIRD cluster, and that a strong association between dyslipidemia and insulin resistance has long been established. 32,33esides, the SOIRD cluster had the highest ALT and AST, and highest prevalence of self-reported fatty liver, which is consistent with previous studies. 6,15Moreover, benefiting from the detailed questionnaire, the present study also innovatively found that the SOIRD category had the highest rates of tumor, family history of diabetes, and family history of tumors.We hypothesized that unhealthy lifestyle habits among family members influence each other, such as poor dietary habits causing obesity and insufficient exercise 34 and obesity 31 leading to insulin resistance, therefore, diabetic patients in the SOIRD cluster exhibit familial aggregation.However, it is unfortunate that the causal relationship between poor lifestyle and the SOIRD subgroup is not clear in this study, and we hope that prospective cohort studies will be conducted in the future to clarify whether changes in poor dietary habits and sensible exercise could improve insulin resistance and glucose metabolism abnormalities in diabetes populations with SOIRD subgroup typing.In addition, there is evidence that obesity drives morphological and functional changes in cancer cells through complex interactions within the tumor microenvironment, 35 while hyperinsulinemia promotes tumor development 36 ; on the other hand, insulin resistance is closely associated with visceral fat dysfunction and systemic inflammation, both of which are conducive to the establishment of a protumorigenic milieu, 36 and the above could explain the high tumor disease history of the SOIRD subgroup.
In the current study, a new diabetes subgroup, MIDD, was identified, which had lower HOMA-β (only higher than the SIDD subgroup but significantly lower than the other subgroups), suggesting insufficient insulin secretion.In addition, HbA1c and FBG in the MIDD subgroup were the lowest, with the medians being in the prediabetic category, whereas the median PBG met the diagnostic criteria for diabetes, which is typical of Chinese diabetes populations, as the diet of the Chinese population is characterized by a predominance of high carbohydrates. 37The MIDD subgroup had the lowest BMI and WC, both of which were in the normal range, and other biochemical indices were better than those of the other subgroups, suggesting that metabolic control was relatively good in this subgroup.In fact, compared with Caucasian populations, Asian populations with diabetes tend to be thinner but have more impaired pancreatic function 38 because Asians have an innate deficiency in insulin secretion capacity. 39The present study is of great clinical significance in identifying MIDD as a subgroup of diabetes among Chinese population, because this type of diabetic population tends to ignore their condition due to mild or no symptoms; however, their pancreatic islet function has already been damaged, so this diabetic population is advised to have regular medical checkups and to take reasonable measures for glycemic control in order to avoid further impairment of the pancreatic islet function.
In agreement with previous studies, 6,9,10,[15][16][17] the current study also found that the SIDD subgroup had the lowest HOMA-β, suggesting that pancreatic islet function had been severely damaged, and had the highest HbA1c, FBG and PBG, suggesting poor glycemic control.Besides, we found the highest prevalence of dyslipidemia in this subgroup, indicating poorer metabolic control.Previous studies have shown that newly diagnosed Chinese diabetic populations with combined hyperlipidemia exhibit decreased insulin secretion rather than impaired insulin sensitivity, 40 which might be due to dyslipidemia leading to elevated circulating free fatty acid levels, resulting in impaired β-cell function. 40There is evidence that antihyperlipidemic drugs improve β-cell function in diabetic populations. 41Given the highest prevalence of dyslipidemia in the SIDD subgroup of this study, we consider that stricter lipid control in this type of populations might be a new therapeutic idea.We also found that the SIDD subgroup had the highest UACR, and previous studies 6,18 have similarly found that this subgroup has the highest risk of albuminuria.In addition, logistic regression analysis showed that despite adjusting for confounders, the risk of CKD was still highest in the SIDD subgroup, which was 2.24 times higher than in the lowest subgroup.The causal relationship between abnormal β-cell function and the development of albuminuria has been demonstrated. 42Furthermore, we found that the SIDD subgroup had the largest percentage of smokers and drinkers, followed closely by the MIDD subgroup.Numerous population-based studies [43][44][45][46] have shown a strong association between smoking and alcohol consumption and impaired β-cell function.Studies using rodent models have shown that nicotine exposure could mediate β-cell dysfunction, pancreatic β-cell apoptosis, and β-cell mass loss through mitochondrial or death receptor pathways. 47tudies in rats have also shown that chronic high-dose alcohol intake could lead to altered insulin, messenger RNA (mRNA) gene expression, and impaired pancreatic β-cell function. 48This might also explain the high percentage of men in the SIDD and MIDD subgroups, as men are more likely to be smokers and drinkers and such habits decrease insulin secretion.We recommend education on smoking and alcohol cessation for diabetic populations classified as SIDD or MIDD subgroups to avoid further decline in pancreatic β-cell functions.
Consistent with previous studies, 4,6,9,11,[14][15][16]18,28 the MARD subgroup in the current study had the oldest age and moderate metabolic dysregulation, and the lowest prevalence of family history of diabetes. 6 TheMARD subgroup had the highest prevalence of hypertension and the highest SBP; however, it is noteworthy that DBP was lower than average in this subgroup (only slightly higher than that in the SIDD cluster), suggesting that populations in this subgroup should be alerted not only to the risk of hypertension but also to the risk of increased arterial stiffness represented by a high pulse pressure gap.49 In logistic regression analyses, we found that despite adjusting for numerous biochemical and lifestyle confounders, the MARD subgroup had the highest risk of cardiovascular disease, which was 1.564 times higher than that of the lowest subgroup.Advanced age is a wellknown risk factor for cardiovascular disease 50 ; therefore even though patients in the MARD subgroup did not have the worst glycemic levels and metabolic the risk of developing major cardiovascular disease was the highest.Interestingly, in logistic regression analyses, the SOIRD subgroup also had a high cardiovascular risk, which was 1.537 times that of the lowest subgroup.It has been demonstrated that the degree of insulin resistance could independently influence macrovascular complications of diabetes 51 ; consequently, even after adjusting for confounders, the risk of cardiovascular disease was high in the SOIRD subgroup, although age was lowest in this subgroup.
The current study is the first one that clustered and typed a large sample, multicenter Chinese community diabetes patients.The current study population is more generally representative than patients attending hospitals.In addition, we have detailed information on possible confounding factors.Furthermore, benefiting from the detailed questionnaire in the present study, we could explore differences in lifestyle habits and family history of disease among the various subgroups of populations, which has rarely been explored in previous studies.This also allows us to provide more individualized advice to middle-aged and elderly Chinese community diabetic populations; for instance, SOIRD populaions should change their poor dietary habits and exercise more, SIDD and MIDD populations need to reduce smoking and drinking habits, and MARD populations should focus not only on blood glucose but also on the risk of hypertension and atherosclerosis.However, the current study still has some limitations.First, GADA were not measured or included in the cluster analysis in the present study.However, the prevalence of GADA-positive diabetes among Chinese adults with diabetes is likely to be less than 5.9% 6,10 and might be even lower in population-based screening because GADA-positive diabetic populations are usually diagnosed before they are captured by screening owing to their acute diabetic complications. 10Second, the age of the subjects in the current study was 40 years and older; however, this is the common age for the onset of diabetes. 52There should be studies clustering and typing community diabetes populations in broader age groups.Third, the present study is a cross-sectional study, and previous investigation 53 reported that the subgroup typing of diabetic populations could be transformed with the course of the disease; however, in the present study, it remains inaccessible to learn whether the current subgroups transform with the disease progression.In addition, without followup data, we were unable to further investigate whether the clustering method in the present study was superior to prediction methods based on baseline variables or supervised machine learning methods in terms of its ability to predict patients' future complications and disease severity.Moreover, due to the lack of follow-up data on these patients newly diagnosed with diabetes in the present epidemiologic survey, we do not know the effect of glucose-lowering drugs on complications in these subgroups of diabetic populations and the proportion of death from complications in different subgroups of populations.In order to expand our study and better assist in the clinical management of diabetes mellitus, long-term follow-up studies are essential to understand illness progression and therapy response.

| CONCLUSIONS
In the current study, the data-driven approach to differentiating the status of new-onset diabetes in the Chinese community was reproducible, and the distribution of patients was similar, but not identical, to that of the study by Ahlqvist et al.We first reported the MIDD subgroup, which had a low risk burden equivalent to prediabetes, but with reduced insulin secretion.The SOIRD subgroup was characterized by obesity and insulin resistance and had a high prevalence of fatty liver, tumors, family history of diabetes, and family history of tumors.The SIDD subgroup had severe insulin deficiency, the poorest glycemic control, and the highest prevalence of dyslipidemia and CKD.The MARD subgroup in the current study had the oldest age and moderate metabolic dysregulation, and the highest risk of cardiovascular disease and hypertension.The findings of the current study could contribute to improve early prevention and targeted treatment of diabetes in Chinese community populations.

Figure
FigureS1showed that adding or not adding PBG as a clustering variable resulted in a difference in clustering
T A B L E 3 T A B L E 4 Association between diabetes-related complications and subgroups.